{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "09c090b1-80d5-4c93-80da-5adc3a58da5c",
   "metadata": {},
   "source": [
    "# Retrieaval Augmented Generation（RAG）\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5837ba9-6606-44c3-8016-c7ce6a16934b",
   "metadata": {},
   "source": [
    "## 1 \n",
    "```\n",
    "pip install langchain_core\n",
    "pip install langchain_com\n",
    "# pip install langchain_community==0.3.12\n",
    "# pip install chromadb \n",
    "# pip install tiktoken\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7ac926f-b12c-45a7-8a6e-3e8c874c3a6b",
   "metadata": {},
   "source": [
    "![rag1](../datas/rag1.png)\n",
    "![rag1](../datas/rag2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df9f00bc-85be-4a98-9fda-9f720f6f0375",
   "metadata": {},
   "source": [
    "## 2 Documents\n",
    "### 2.1 define"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "ab193ae3-aecf-4ac4-8f59-bbcb6afea64b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.documents import Document"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "241c596e-a867-4dbe-ace2-280d33c57519",
   "metadata": {},
   "source": [
    "### 2.2 load file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "2735872d-f052-4e21-a2b4-7438b4ce56ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "#%pip install pypdf\n",
    "from langchain_community.document_loaders import PyPDFLoader\n",
    "\n",
    "file_path = \"../datas/History A Very Short Introduction.pdf\"\n",
    "loader = PyPDFLoader(file_path)\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "eab7cabb-b022-46cd-a43c-57869f4d679c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "152\n",
      "{\n",
      "    \"source\": \"../datas/History A Very Short Introduction.pdf\",\n",
      "    \"page\": 0\n",
      "}\n",
      "----------\n",
      "History: A Very Short Introduction\n",
      "‘A stimulating and provocative introduction to one of collective\n",
      "humanity’s most important quests – understanding t\n",
      "----------\n",
      "Chapter 1\n",
      "Questions about murder\n",
      "and history\n",
      "Here is a true story. In 1301 Guilhem de Rodes hurried down from his\n",
      "Pyrenean village of Tarascon to the \n",
      "----------\n",
      "the heretics had a spy within the monastery. This spy, the beguin said,\n",
      "was linked to the heretics through his brother, a member of the laity,\n",
      "and a f\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "print(len(docs))\n",
    "print(json.dumps(docs[0].metadata,indent=4))\n",
    "print(\"-\"*10)\n",
    "print(docs[0].page_content[:150])\n",
    "print(\"-\"*10)\n",
    "print(docs[12].page_content[:150])\n",
    "print(\"-\"*10)\n",
    "print(docs[13].page_content[:150])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a1b33c7-690c-49dd-a06c-3330c5785ed0",
   "metadata": {},
   "source": [
    "## 3 Split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "e9ef4b66-9928-4e71-a8e1-55d6e93cf3fc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "History: A Very Short Introduction\n",
      "‘A stimulating and provocative introduction to one of collective\n",
      "*metadata*\n",
      "{'source': '../datas/History A Very Short Introduction.pdf', 'page': 0, 'start_index': 0}\n",
      "---------------------\n",
      "humanity’s most important quests – understanding the past and its\n",
      "*metadata*\n",
      "{'source': '../datas/History A Very Short Introduction.pdf', 'page': 0, 'start_index': 100}\n",
      "---------------------\n",
      "relation to the present. A vivid mix of telling examples and clear-cut\n",
      "analysis.’\n",
      "*metadata*\n",
      "{'source': '../datas/History A Very Short Introduction.pdf', 'page': 0, 'start_index': 166}\n",
      "---------------------\n"
     ]
    }
   ],
   "source": [
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=100, chunk_overlap=20, add_start_index=True\n",
    ")\n",
    "all_splits = text_splitter.split_documents(docs)\n",
    "\n",
    "for doc in all_splits[:3]:\n",
    "    print(doc.page_content)\n",
    "    print(\"*metadata*\")\n",
    "    print(doc.metadata)\n",
    "    print(\"---------------------\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e8d1a55-b9a1-483a-9f08-c5c2c0695e54",
   "metadata": {},
   "source": [
    "## 4 Embedding\n",
    " it is defficult to find a free embedding model access from web.\n",
    " visit *huggingface.co* and search by BAAI which mean Beijing Academica of AI.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "b97470e2-80ff-4f88-b7b7-797c1a73aa21",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generated vectors of length 768\n",
      "\n",
      "[-0.4221174120903015, 0.35957491397857666, -4.204888343811035, -1.03995943069458, -0.7558143138885498, 1.4048606157302856, -1.3307297229766846, 0.14616094529628754, -0.19596339762210846, -0.2266235053539276]\n",
      "[-0.5790578126907349, 1.1685622930526733, -3.774078369140625, -0.40715155005455017, -0.5284842252731323, 0.22703196108341217, -1.2266535758972168, 0.40388086438179016, -0.9309250116348267, -0.44798630475997925]\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.embeddings import OllamaEmbeddings\n",
    "embeddings = OllamaEmbeddings(model='nomic-embed-text', base_url=\"http://192.168.99.142:11434\")\n",
    "#embeddings = HuggingFaceEmbeddings(model_name=\"BAAI/bge-small-en-v1.5\")\n",
    "#embeddings = HuggingFaceEmbeddings(model_name=\"BAAI/bge-large-zh-v1.5\")\n",
    "\n",
    "vector_1 = embeddings.embed_query(\"queen and king\")\n",
    "vector_2 = embeddings.embed_query(\"king\")\n",
    "\n",
    "assert len(vector_1) == len(vector_2)\n",
    "print(f\"Generated vectors of length {len(vector_1)}\\n\")\n",
    "print(vector_1[:10])\n",
    "print(vector_2[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "194aa1dc-9967-4de8-a7d0-24919991f30e",
   "metadata": {},
   "source": [
    "# 5 VectorStore\n",
    "\n",
    "LangChain VectorStore objects do not subclass Runnable. LangChain Retrievers are Runnables, so they implement a standard set of methods (e.g., synchronous and asynchronous invoke and batch operations).Vectorstores implement an as_retriever method that will generate a Retriever, specifically a VectorStoreRetriever."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "423310df-6f72-4ee6-a1cf-de2f52c39a19",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['1f378a3c-1f26-4103-9626-6ef5d12f5abf',\n",
       " 'f0b9b13f-6e64-432f-b430-891579002a67']"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain_core.vectorstores import InMemoryVectorStore\n",
    "docs = [\n",
    "    Document(\n",
    "        page_content=\"Dogs are great companions, known for their loyalty and friendliness.\",\n",
    "        metadata={\"source\": \"mammal-pets-doc\"},\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Cats are independent pets that often enjoy their own space.\",\n",
    "        metadata={\"source\": \"mammal-pets-doc\"},\n",
    "    ),\n",
    "]\n",
    "\n",
    "vector_store = InMemoryVectorStore(embeddings)\n",
    "vector_store.add_documents(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "e7acbb44-83fd-4f95-bcd0-a7f6975109d1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "page_content='Dogs are great companions, known for their loyalty and friendliness.' metadata={'source': 'mammal-pets-doc'}\n"
     ]
    }
   ],
   "source": [
    "results = vector_store.similarity_search(\n",
    "    \"dogs?\"\n",
    ")\n",
    "print(results[0])\n",
    "\n",
    "#vector_store.similarity_search_with_score(\"dogs\")\n",
    "#doc, score = results[0]\n",
    "#print(f\"Score: {score}\\n\")\n",
    "#print(doc)\n",
    "#\n",
    "#embedding = embeddings.embed_query(\"dogs\")\n",
    "#results = vector_store.similarity_search_by_vector(embedding)\n",
    "#print(results[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "341f834c-92d1-4826-88b9-5d3fd13e5218",
   "metadata": {},
   "source": [
    "## 6 Retriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "34ac4889-3f2d-4e27-8ba0-fbafba505a62",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Document(id='1f378a3c-1f26-4103-9626-6ef5d12f5abf', metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known for their loyalty and friendliness.')]\n"
     ]
    }
   ],
   "source": [
    "retriever = vector_store.as_retriever(\n",
    "    search_type=\"similarity\",\n",
    "    search_kwargs={\"k\": 1},\n",
    ")\n",
    "\n",
    "doc = retriever.invoke(\"dogs\")\n",
    "print(doc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11807c42-2e2e-49f3-b402-c2d839fc2f88",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
